[MLAS] Add an NHWC implementation of convolution to avoid transposes#26834
[MLAS] Add an NHWC implementation of convolution to avoid transposes#26834orlmon01 wants to merge 81 commits intomicrosoft:mainfrom
Conversation
…transposes * Modification to the CPU EP to specify channels_last when data format is NWHC * Added a FusedNhwcConv kernel * Implementation of the kernel in mlas * Added compiler guards so it is inly used with KleidiAi (for now, can be removed if needed) * Added unittests Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
|
@microsoft-github-policy-service agree company="Arm" |
|
Feedback appreciated as this PR makes quite a lot of changes to the codebase well outside of the normal KleidiAI scope. |
Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
|
Hi @orlmon01, I imagine that avoiding transposes also improves performance. |
Hiya, I have some numbers somewhere from a Mobilenet model I was using for testing that I'll add in a bit, once I find / regenerate them. :) |
|
mobilenet model without the current patch:
Same model with changes:
|
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Update to the internal_testings_tests helper macros for file expansion so it works on other platforms
Fix for failing ConvDepthwiseFloat test, allows for a small tolerance when running on different hardware
For for failing TestSaveAndLoadOrtModel test Make sure the model being saved / loaded is being done from a writeable location
Fix for undeclared identifier linker error
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
This PR adds an NHWC (channels-last) implementation of convolution operations to avoid costly transpose operations in the CPU execution provider. The implementation includes KleidiAI-specific optimizations and a fallback path for NHWC convolutions.
Changes:
- Added
NhwcFusedConvkernel for float32 convolutions in NHWC layout (KleidiAI-guarded) - Implemented NHWC fast path and fallback path with explicit NHWC↔NCHW conversions in MLAS
- Extended test infrastructure to resolve paths dynamically and filter NHWC transformers in existing tests
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/core/providers/cpu/nn/conv.h | Added channels_last_ flag to Conv kernel |
| onnxruntime/core/providers/cpu/nn/conv.cc | Implemented NHWC convolution logic with fast path and fallback |
| onnxruntime/core/optimizer/nhwc_transformer.cc | Added KleidiAI filter and FusedConv sum input handling |
| onnxruntime/core/mlas/lib/convolve.cpp | Added ChannelsLast parameter to MlasConvPrepare |
| onnxruntime/core/mlas/inc/mlas.h | Added ChannelsLast field to MLAS_CONV_PARAMETERS |
| onnxruntime/contrib_ops/cpu/fused_conv.cc | Registered NhwcFusedConv kernel |
| onnxruntime/test/optimizer/nhwc_transformer_test.cc | Added depthwise convolution test case |
| onnxruntime/test/optimizer/fuse_initializers_transformer_test.cc | Filtered NhwcTransformer in tests |
| onnxruntime/test/optimizer/conv_add_act_test.cc | Updated to handle both FusedConv variants |
| onnxruntime/test/internal_testing_ep/internal_testing_tests.cc | Added path resolution utilities |
| onnxruntime/test/framework/ort_model_only_test.cc | Added path resolution with diagnostic output |
| onnxruntime/core/util/math_cpu.cc | Added Im2col instantiation for float NHWC |
| onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp | Updated to support channels-last input |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ons are 0 Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
…r if needed Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
No pipelines are associated with this pull request. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 35 out of 35 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
* NhwcFusedConv should now be available in minimal builds as the resolver bytes contain it Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
* RHS hashing now uses the full tensor to ensure uniqueness * LHS no longer uses hashing as it's unnecessary Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
|
Can you please push the doc change too ? I ll kick off CI post that |
Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
Done. I hadn't seen your message when I was pushing the Co-Pilot fix. :) |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
No pipelines are associated with this pull request. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 36 out of 36 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
onnxruntime/test/optimizer/nhwc_transformer_test.cc:1
- This helper treats
stride == 0as valid (it only rejects< 0) and will propagate it into padding/output-shape computation, which can cause invalid behavior (stride must be > 0 in ONNX Conv). To keep the test-side capability predicate aligned with the production-side checks (and avoid accidental divide-by-zero paths), change the validation here (and the analogous dilations check) to reject<= 0.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
| ORT_RETURN_IF_ERROR( | ||
| kernel_type_str_resolver_utils::AddLayoutTransformationRequiredOpsToKernelTypeStrResolver( | ||
| kernel_type_str_resolver)); |
There was a problem hiding this comment.
why do we need to add them to the saved ORT format model? they are already added at load time:
onnxruntime/onnxruntime/core/session/inference_session.cc
Lines 1896 to 1900 in 8dd4a06
| #endif // !defined(DISABLE_CONTRIB_OPS) | ||
| } | ||
|
|
||
| #if !defined(ORT_MINIMAL_BUILD) && !defined(DISABLE_CONTRIB_OPS) && defined(USE_KLEIDIAI) |
There was a problem hiding this comment.
is this specific to USE_KLEIDIAI? NhwcFusedConv is added to the layout transformation ops even if USE_KLEIDIAI is not defined.
| ASSERT_FALSE(resolved_args.empty()); | ||
| } | ||
|
|
||
| TEST(KernelTypeStrResolverUtilsTest, SavedOrtModelResolverContainsNhwcFusedConv) { |
There was a problem hiding this comment.
I think the ORT format model itself doesn't need to contain the layout transformation ops if we add them at load time. that decision was originally made to avoid unnecessarily increasing the model size.
Description
Currently OnnxRT supports NCHW as a default datalayout. For optimisations and kernels that operate better in NHWC layout, or where the datalayout is NHWC in the first place Transposes are added around the layers. This patch seeks to eliminate them in cases of convolutions where it would cause a performance decrease.
Motivation and Context
KleidiAi specific implementation of this feature. Only supports convolutions, DepthWise to follow. Currently a little strict with the filters as a result.